The Problem

When it’s shown as an average number of years in school and levels of achievement, the developing world is about 100 years behind developed countries.” (Winthrop and McGivney 2015 - Brookings Report on Global Educational Inequality)

Despite the immense efforts of bolstering educational attainment worldwide, the world is facing a learning crisis, particularly in the Global South. As a result, development efforts progress towards more inclusive educational practices worldwide. Yet there is a lack of accessible data to understand how education varies subnationally, and in a way that accounts for histories of repression and political exclusion that have barred certain groups from equal access to education. For example, while governments and policy-makers debate how education can be inclusive of diverse languages and ethnicities, no data is currently readily available that accounts for ethnic groups’ educational attainment across time/space.

Much of the data on global education rates aggreate at the country-level and there is no opportunity to account for sub-national variation. Country-level data may not be representative of an entire population. For example, while a country may have an average education completion rate of 7 years, that does not necessitate that the education rate is evenly dispersed across the population. We should expect that certain groups with histories of political marginalization to have lower education rates; however there is little data that allows for demonstrating this cross-nationally at this time. In other words, while the above quote may prove a shocking comparison between the Global South and North, it may actually be understating the extent of educational inequality of groups that have been historically marginalized within countries.

The Solution: Ethnic Group Education (EGE) Dataset

The following is an Rmarkdown document that highlights the creation of the Ethnic Group Education dataset (EGE). The EGE is a dataset that will ultimately include all major ethnic groups per country-year (1969-2015) and their educational attainment, as no such dataset/information readily exists.

As a preliminary first cut at such a dataset and proof of concept, I first construct the EGE for 35 countries in Africa.

In short, the dataset is constructed by:

  1. Taking Afrobarometer Survey waves 4-6 and merging them into one dataset.
  2. Use the Linking Ethnic Data in Africa Dataset Package in R to link respondents’ languages to ethnic groups across Afrobarometer waves. Specificaly I link individuals to ethnic groups as listed in the Ethnic Power Relations dataset, which includes country-year information on ethnic groups and their relative political status (Monopoly, Dominant, Senior Partner, Junior Partner, Powerless, Discrimianted, Irrelelvant).
  3. Using individual levels of education from 2008-2016 to backtrack educational attainment averages per ethnic group per year.

The end result is a dataset that contains country-year information on every major ethnic group in 35 African countries and their corresponding:

  1. Average educational attainment from 1969-2015 for all major ethnic groups in 36 countries in Africa
  2. The corresponding ethnic group information (i.e. is the ethnic group excluded from power, dominant, etc.)

Relevance & Need of EGE

Policy-makers and political scientists studying authoritarian regime maintenance illustrate how education invites both risk and reward for non-democratic states. Education increases pro-democratic attitudes, political dis-engagement, and ultimately autocratic failure. At the same time, political elites in authoritarian countries are predicted to be hesitant towards investing in disenfranchised populations. However, education has also been found to bolster national loyalty, human capital, and long-term development. Nor is the real-world variation clear, non-democratic countries display significant variation in educational investment and attainment in addition to varied relationships between education and political participation.The question, “When do policy makers and political elites in non-democratic states support meaningful education efforts?” remains contested at best.

My research begins to answer this question by investigating two factors: the ethnic diversity of the country and the extent to which the government uses propaganda in schools. Education does not have a uniform effect. Education will not instill similarly pro-democratic attitudes across a diverse population - even if the education “treatment” is constant. In other words, similar educational policies and initiatives across non-democratic countries can have opposite outcomes - jeopardizing or strengthening political stability. Similarly, at the individual level, increased education can lead to individuals becoming supportive of or opposed to their governments.

My research currently focuses on three inter-related questions:

  1. When does education strengthen or weaken national identity?
  2. When does education lead to autocratic stability or democratization?
  3. Do inclusive educational policies that recognize previously marginalized cultures/languages foster a shared or divisive national identity?

The following data construction effort allows us to begin answering each of these questions.

Code & Creation of EGE

The following highlights the construction of the dataset, and then provides some preliminary figures/information using the dataset.

First, I load in all necessary packages and three waves of the “Afrobarometer” surveys. Afrobarometer is a regional survey initiative that askes over 2000 respondents per country in Africa identical questions, and is similar to many other regional survey efforts (Arabometer, Latinobarómetro, Eurobarometer, etc.). Using Afrobarometer allows for a proof of concept that could expand into other regions and ultimately result in a fully global dataset.

Merging Afrobarometer Datasets

Each “wave” of Afrobarometer occured at a different time and with different respondents.

  • Wave 4: 20 countries in 2008
  • Wave 5: 34 countries in 2011-2013
  • Wave 6: 36 countries 2016

By combining these three waves we can have a significantly larger sample of respondents from which we can caluclate averge educational levels across time, improving our accuracy and confidence in our estimates.

Afrobarometer Round 4

First, Afrobarometer numbers their countries differently each round. Therefore, I want to get their country names in the data so I can use their Correlates of War (COW) Country Codes. For each round, I created a corresponding excel document that lists the country name and Afrobarometer value. I then can use the Country Code package to standardize the values.

The datasets are large and we need to clean our variables of interest. We’ll keep the following demographic information from the Afrobarometer Wave 4:

  • age (Q1)
  • education (Q89)
  • language (Q3)
  • country (COUNTRY)
  • survey year (DATEINTR)
  • ethnic or national identity (Q83)
  • male
  • employment
  • urban/rural

For demonstration purposes later on, we also keep the following public opinion information for each respondent.

  • views on democracy (general)
  • extent of democracy in [Respondent Country]
  • satisfcation w/ democracy in [Respondent country]
  • Trust in President/Prime Minister
  • Trust in Parliament
  • Trust in Ruling Party
  • Perception of ethnic group treatment by government

The following code highlights how each of the above variables is re-coded for ease of analysis and interepretation for Wave 4.

## Age
# Question Number: Q1
# Question: How old are you?
#    Variable Label: Q1. Age
# Values: 18-110, 998-999, -1
# Value Labels: 998=Refused to answer, 999=Don't know, -1=Missing 

ab4$age <- ab4$Q1
ab4$age <- as.numeric(as.character(ab4$age))
ab4$age[ab4$age == -1] <- NA
ab4$age[ab4$age == 998] <- NA
ab4$age[ab4$age == 999] <- NA
#table(ab4$age)

## Education
# Question Number: Q89
# Question: What is the highest level of education you have completed?
# Variable Label: Education of respondent
# Values: 0-9, 99, 998 -1
# Value Labels: 0=No formal schooling, 1=Informal schooling only (including Koranic schooling), 2=Some primary schooling, 3=Primary school completed, 4=Some secondary school/ high school, 5=Secondary school completed/high school completed, 6=Post-secondary qualifications, other than university e.g. a diploma or degree from polytechnic or college, 7=Some university, 8=University completed, 9=Post-graduate, 99=Don’t know, 998=Refused to answer, -1=Missing data

ab4$edu <- as.numeric(ab4$Q89)
ab4$edu[ab4$edu == -1] <- NA
ab4$edu[ab4$edu == 99] <- NA
#table(ab4$edu)

ab4$primary <- ifelse(ab4$edu >= 3, 1, 0)
ab4$secondary <- ifelse(ab4$edu >= 5, 1, 0)
ab4$tertiary <- ifelse(ab4$edu >= 8, 1, 0)

## Language
# Question Number: Q3
# Question: Which [country] language is your home language?
# Variable Label: Language of Respondent
# Values: See codebook
# Value Labebls: See codebook

ab4$language <- ab4$Q3

## Survey Year
# Question: Date of interview
# Variable Label: Date of interview
# Values: 04.03.08 – 31.12.08

# table(ab4$DATEINTR)
# Despite the codebook saying the values are only in 2008, the table indicates that some respondents were interviewed into 2009.
# Therefore, I'll create a new variable that takes the first 4 digits/integers of the DATEINTR variable.

ab4$year <- ab4$DATEINTR
ab4$year <- as.character(ab4$year)
ab4$year <- str_sub(ab4$year, 1, 4)
ab4$year <- as.numeric(ab4$year)
#table(ab4$year)

## Ethnic vs. National Identity
# Question Number: Q83
# Question: Let us suppose that you had to choose between being a [Ghanaian/Kenyan/etc.] and being a ________ [R’s Ethnic Group]. Which of the following best expresses your feelings?
# Variable Label: Ethnic or national identity
# Values: 1-5, 7, 9, 998, -1
# Value Labels: 1=I feel only (R’s ethnic group), 2=I feel more (R’s ethnic group) than [Ghanaian/Kenyan/etc.], 3=I feel equally [Ghanaian/Kenyan/etc.] and (R’s ethnic group), 4=I feel more [Ghanaian/Kenyan/etc.] than (R’s ethnic group), 5=I feel only [Ghanaian/Kenyan/etc.], 7=Not applicable, 9=Don’t know, 998=Refused to answer, - 1=Missing data

ab4$identity <- ab4$Q83
ab4$identity[ab4$identity == -1] <- NA
ab4$identity[ab4$identity == 7] <- NA
ab4$identity[ab4$identity == 9] <- NA

# Urban vs. Rural
## Question Number: URBRUR
## Question: PSU/EA
## Variable Label: Urban or Rural Primary Sampling Unit Values: 1-2
## Value Labels: 1=urban, 2=rural
## Note: Answered by interviewer

ab4$rural <- ab4$URBRUR
ab4$rural <- ab4$rural - 1
#1: rural, 0: urban

# Sex
# Question Number: THISINT
# Question: This interview must be with a: Variable Label: This interview, gender Values: 1, 2
# Value Labels: 1=Male, 2=Female
# Note: Answered by interviewer

ab4$female <- ab4$THISINT
ab4$female <- ab4$female-1
#1: female, 0 : male

# Employed
# Question Number: Q94
# Question: Do you have a job that pays a cash income? Is it full-time or part-time? And are you presently looking for a job (even if you are presently working)?
# Variable Label: Employment status
# Values: 0-5, 9, 998, -1
# Value Labels: 0=No (not looking), 1=No (looking), 2=Yes, part time (not looking), 3=Yes, part time (looking), 4=Yes, full time (not looking), 5=Yes, full time (looking), 9=Don’t know, 998=Refused to answer, -1=Missing data Source: SAB

ab4$employment <- ab4$Q94
ab4$employment[ab4$employment == -1] <- NA
ab4$employment[ab4$employment == 9] <- NA

ab4$employed <- ifelse(ab4$employment > 1, 1, 0)

# View on Democracy
# Question Number: Q30
# Question: Which of these three statements is closest to your own opinion?
# Statement 1: Democracy is preferable to any other kind of government.
# Statement 2: In some circumstances, a non-democratic government can be preferable.
# Statement 3: For someone like me, it doesn’t matter what kind of government we have.
# Variable Label: Support for democracy
# Values: 1-3, 9, 998, -1
# Value Labels: 1=Statement 3: Doesn’t matter, 2=Statement 2: Sometimes non-democratic preferable, 3=Statement 1: Democracy preferable, 9=Don’t know, 998=Refused to answer, -1=Missing data

#table(ab4$Q30)
ab4$democracy <- ab4$Q30
ab4$democracy[ab4$democracy == -1] <- NA
ab4$democracy[ab4$democracy == 9] <- NA
#table(ab4$democracy)

# Extent of Democracy in [Country]
# Question Number: Q42A
# Question: In your opinion how much of a democracy is [Ghana/Kenya/etc.]? today?
# Variable Label: Extent of democracy
# Values: 1-4, 8, 9, 998, -1
# Value Labels: 1=Not a democracy, 2=A democracy, with major problems, 3=A democracy, but with minor problems, 4=A full democracy, 8=Do not understand question/ do not understand what ‘democracy’ is, 9=Don’t know, 998=Refused to answer, -1=Missing data
# Source: Ghana 97

# table(ab4$Q42A)
ab4$democracyInCountry <- ab4$Q42A
ab4$democracyInCountry[ab4$democracyInCountry == -1] <- NA
ab4$democracyInCountry[ab4$democracyInCountry == 8] <- NA
ab4$democracyInCountry[ab4$democracyInCountry == 9] <- NA
# table(ab4$democracyInCountry)

# Satisfied w/ Democracy in [Country]
# Question Number: Q43
# Question: Overall, how satisfied are you with the way democracy works in [Ghana/Kenya/etc.]? Are you: Variable Label: Satisfaction with democracy
# Values: 0-4, 9, 998, -1
# Value Labels: 0=My country is not a democracy, 1=Not at all satisfied, 2=Not very satisfied, 3=Fairly satisfied, 4=Very satisfied, 9=Don’t know, 998=Refused to answer, -1=Missing data
# Source: Eurobarometer

#table(ab4$Q43)
ab4$satisfiedDemInCountry <- ab4$Q43
ab4$satisfiedDemInCountry[ab4$satisfiedDemInCountry == -1] <- NA
ab4$satisfiedDemInCountry[ab4$satisfiedDemInCountry == 9] <- NA
#table(ab4$satisfiedDemInCountry)

# Trust in President
# Question Number: Q49A
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: The President?
# Variable Label: Trust president
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Zambia96
# Note: “Prime Minister” in Lesotho; “President” and “Prime Minister” in Burkina Faso, Cape Verde, Madagascar, Mali, Mozambique, Namibia, Senegal and Zimbabwe; “President” in Benin, Botswana, Ghana, Kenya, Liberia, Malawi, Nigeria, South Africa, Tanzania, Uganda, and Zambia.

ab4$trustPresident <- ab4$Q49A
ab4$trustPresident[ab4$trustPresident == -1] <- NA
ab4$trustPresident[ab4$trustPresident == 9] <- NA
# table(ab4$trustPresident)

# Trust in Parliament
# Question Number: Q49B
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: Parliament?
# Variable Label: Trust parliament/national assembly
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Adapted from Zambia96
# Note: “National Assembly” in Benin, Burkina Faso, Cape Verde, Liberia, Madagascar, Malawi, Mali, Mozambique, Nigeria, Tanzania, Uganda, Zambia; “Parliament” in Botswana, Ghana, Kenya, Lesotho, Namibia, Senegal, ans South Africa; “House of Assembly” in Zimbabwe.

#table(ab4$Q49B)
ab4$trustParliament <- ab4$Q49B
ab4$trustParliament[ab4$trustParliament == -1] <- NA
ab4$trustParliament[ab4$trustParliament == 9] <- NA
#table(ab4$trustParliament)

# Trust in Ruling Party
# Question Number: Q49E
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: The Ruling Party?
# Variable Label: Trust the ruling party
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Adapted from Zambia96

#table(ab4$Q49E)
ab4$trustRP <- ab4$Q49E
ab4$trustRP[ab4$trustRP == -1] <- NA
ab4$trustRP[ab4$trustRP == 9] <- NA
#table(ab4$trustRP)

# Trust Traditional Leaders
# Question Number: Q49I
# Question: How much do you trust each of the following, or haven’t you heard enough about them to say: Traditional leaders
# Variable Label: Trust traditional leaders
# Values: 0-3, 9, 998, -1
# Value Labels: 0=Not at all, 1=Just a little, 2=Somewhat, 3=A lot, 9=Don’t know/Haven’t heard enough, 998=Refused to answer, -1=Missing data
# Source: Zambia 96

#table(ab4$Q49I)
#ab4$trustTL <- ab4$Q49I
#ab4$trustTL[ab4$trustTL == -1] <- NA
#ab4$trustTL[ab4$trustTL == 7] <- NA
#ab4$trustTL[ab4$trustTL == 9] <- NA
#table(ab4$trustTL)

# Ethnic Group Treated Unfairly
# Question Number: Q82
# Question: How often are ___________s [R’s Ethnic Group] treated unfairly by the government?
# Variable Label: Ethnic group treated unfairly
# Values: 0-3, 7, 9, 998, -1
# Value Labels: 0=Never, 1=Sometimes, 2=Often, 3=Always, 7=Not applicable, 9=Don’t know, 998=Refused to answer, -1=Missing data
# Source: SAB
# Note: Interviewer probed for strength of opinion. If respondent did not identify any group on this question – that is, if they “Refused to answer” (998), said “Don’t know” (999), or “Ghanaian only” (990) – then the interviewer marked “Not applicable” for questions 80-83 and continued to question 84.

#table(ab4$Q82)
ab4$treatedUnfairly <- ab4$Q82
ab4$treatedUnfairly[ab4$treatedUnfairly == -1] <- NA
ab4$treatedUnfairly[ab4$treatedUnfairly == 7] <- NA
ab4$treatedUnfairly[ab4$treatedUnfairly == 9] <- NA
#table(ab4$treatedUnfairly)

Now that all the variables are re-coded and cleaned, we can subset our variables of interest to make our datasets a bit smaller. We can then export it as a “small” variant of the data for use later, if need be. Below is just the first 20 observations of the new condensed Afrobarometer Wave 4 dataset.

Repeat for Afrobarometer Round 5 & 6

Now I want to do the same for Round 5 and Round 6 of Afrobarometer. I do not replicate the code below, but it is otherwise identical in execution to the code for Round 4, except that certain variables are identified by different numbers in each wave.

Merging Waves 4-6

Once each Wave is cleaned, recoded, and condensed - they are now identical in that they each have the following variables:

  • COUNTRY
  • Statename
  • ccode
  • RESPNO
  • age
  • edu (+ primary, secondary, tertiary)
  • rural
  • female
  • employment (status)
  • employed (binary)
  • democracy
  • democracy in country
  • trust in president
  • trust in parliament
  • trust in ruling party
  • whether ethnic group is treated unfairly
  • language
  • year
  • identity

Given they have the same order of the columns as well, I could rbind() them; however, the respondent ID variables (“RESPON”) variables will then be duplicated, as they are just a series of numbers for each wave. I.e., Respondent #4 in Wave 4 is different than Respondent #4 in Wave 5. Therefore, the first thing I do is add the year to each of the RESPNO.

## [1] "BEN0001-2008" "BEN0002-2008" "BEN0003-2008" "BEN0004-2008" "BEN0005-2008"
## [6] "BEN0006-2008"

Linking Ethnic Data with EPR

Now, I want to use the Linking Ethnic Data in Africa Dataset package to use the language of each respondent as an indicator of ethnicity, which I can then link to other datasets (such as EPR). The Ethnic Power Relations dataset includes country-year information on ethnic groups and their relative political status (Monopoly, Dominant, Senior Partner, Junior Partner, Powerless, Discrimianted, Irrelelvant).

LEDA() lets me produce a dataset that includes the language name from Afrobarometer and it’s corresponding ethnic group from EPR. Here’s an example.

##     a.group                              b.group        a.type b.type
## 39     Adja                  Southwestern (Adja) Afrobarometer    EPR
## 97     Adja                  Southwestern (Adja) Afrobarometer    EPR
## 166    Adja Southeastern (Yoruba/Nagot and Goun) Afrobarometer    EPR
## 213    Adja                  Southwestern (Adja) Afrobarometer    EPR
## 282    Adja Southeastern (Yoruba/Nagot and Goun) Afrobarometer    EPR
## 329    Adja                  Southwestern (Adja) Afrobarometer    EPR

Next, I need to load in the corresponding ``Language ID Number’’ and ‘’Language Name’’ from Afrobarometer. The following excel documents were created using the Codebooks from Afrobaromter. In short, I would copy and paste the delimmeted list from the codebooks of the ID=language and use excel to automatically make them into individual rows/columns.

Let’s run a few checks to see how much the language IDs from the coodebooks and LEDA() match. Below, I have R output differences between the two for each wave.

##  [1] "Fuls"                                    
##  [2] "Moore"                                   
##  [3] "Senoufo"                                 
##  [4] "Arabe"                                   
##  [5] "Khassonke"                               
##  [6] "Malinke"                                 
##  [7] "Soninke/ Sarakoll"                       
##  [8] "Sonrhai"                                 
##  [9] "Mang'anja"                               
## [10] "Oshiwambo"                               
## [11] "Ijaw/Kalabari/Okirika/Andoni/Ogoni/Nembe"
##  [1] "Moore"                               "Senoufo"                            
##  [3] "Baoule"                              "Bete"                               
##  [5] "Godie"                               "Guere"                              
##  [7] "Diakanke"                            "konianke"                           
##  [9] "Maasai / Samburu"                    "Meru / Embu"                        
## [11] "\"Official\" Malagasy"               "Khassonke"                          
## [13] "Malinke"                             "Peulh / Fulfude"                    
## [15] "Soninke / Sarakolle"                 "Sonrhai"                            
## [17] "Chimang'anja"                        "Oshiwambo (Oshindonga/Oshikwanyama)"
## [19] "Beri beri"                           "Zarrma/Songhai"                     
## [21] "Kabye"
##  [1] "Moore"                               "Senoufo"                            
##  [3] "Baoule"                              "Bete"                               
##  [5] "Godie"                               "Guere"                              
##  [7] "Bangangte"                           "Foufoulde"                          
##  [9] "Mbede"                               "Myene"                              
## [11] "Nzebi/Metie"                         "Punu/Merie"                         
## [13] "Malgache << officiel >>"             "Malgache avec specificite regionale"
## [15] "Khassonke"                           "Malinke"                            
## [17] "Soninke/Sarakole"                    "Portuguese"                         
## [19] "Zarma/Songhai"

So there is some mis-match, but not much, across each list. Therefore, I change/fix the spelling of any languages that are obvious matches - i.e. those with just a one letter difference (which I corroborated to be an alternative spelling online), or difference in accent mark (which LEDA() does not include in any spelling), etc. I do this to match the list of languages as they are spelled in the LEDA() function. Therefore, I fix the spelling as it is in my excel documents.

## [1] "Arabe"                                   
## [2] "Senufo/ Mianka"                          
## [3] "Soninke/ Sarakoll"                       
## [4] "Ijaw/Kalabari/Okirika/Andoni/Ogoni/Nembe"
## [1] "\"Official\" Malagasy" "Senufo"
## [1] "Senufo"

These mis-matches will remain as there seem not to be links available.

Individual Datasets

So at this point I have three datasets for each Afrobarometer round:

  • ab# - the Afrobarometer Dataset that contains the survey information.
  • lang.r# - the language number and corresponding language name from each Afrobarometer round
  • link.ab# - the Afrobarometer language name and corresponding EPR name for each round

Let’s quickly take a look at each.

Now I want to merge EPR with the link.ab# list.

Loading EPR

There are several iterations of aggregating EPR. Let’s first look at the structure of the data. As you can see, it is a condensed year format, so we need to expand and subset it.

So at this point, given how I am linking languages using LEDA() - some languages in Afrobarometer may correspond to multiple ethnic groups in the same country. In other words, 1 language may correspond to multiple ethnic groups, and each ethnic group may have different statuses. Therefore, I adopt the following coding rules.

In the picture below, you can see that the language “Adja” (from the Afrobarometer Wave 4 language repsonse) corresponds to two ethnic groups in Benin - groups in the Southwest and Southeast. The same applies for the language “Goun”. In this case - regardless of the language*ethnic group, their “status” does not change. In otherwords, all individuals who speak “Adja” (in the SW or SE) are “Junior Partners” and all individuals who speak “Goun” are “Junior Partners”. In such cases, I simply collapse the observations (or remove 1 from each, so that I don’t get duplicates when merging).

Example 1 - Excel Screencap

Example 1 - Excel Screencap

The next example gives us two other possibilities. In the first case (light green) the language “Akan” is linked to two ethnic groups in Ghana - the Asanta (Akan) and the “Other Akans”. In this case, the Asante are “Senior Partner” and Other Akans are “Junior Partner”. In such cases, I prioritze whether or not they have power (i.e., I don’t intend to differentiate between senior/junior partner). Therefore, I delete the “Other Akan” group.

Alternatively, the language “Ijaw/Kalabari/Okirika/Andoni/Ogoni/Nembe” (dark greeen) in Nigeria applies to two ethnic groups, the Ijaw and Ogoni; however, the Ijaw are “Junior Partner” and the Ogoni are “Powerless”. Therefore, I delete both as I am unable to differeniate them in the analysis.

Example 2 - Excel Screencap

Example 2 - Excel Screencap

By and large, the majority of countries had no duplicates. Of those that did have duplicates, it was only 1-2 groups. The exception was Namibia - in which nearly all languages coincided with multiple ethnic groups that ultimately had different statuses. I still followed the above rules.

Example 3 - Excel Screencap

Example 3 - Excel Screencap

So at this point, I’ve fully merged the Afrobarometer Wave 4 dataset with EPR, where individuals were connected to ethnic groups based upon their language, where dialect linked individuals to certain ethnic groups as defined in EPR. Let’s take a look at how many individuals now in Afrobarometer Wave 4 have corresponding information in EPR.

## [1] 27713
## [1] 8481
## [1] 19232
## [1] 0.6939703

Nearly 70% of respondents in Afrobarometer Wave 4 have a corresponding EPR group. The missing 30% likely exists, but I had to drop the information because languages corresponding to ethnic groups with conflicting statuses. I do hope to return to this in the future - but for now I move forward as a proof of concept.

At this point, I want to repeat the above steps (beginning with loading EPR) two more times - one for each year of Afrobarometer Wave 5 (2011) and Afrobarometer Wave 6 (2015).

Repeat for AB5 and AB6

I repeat the above steps (beginning with “Loading EPR”) for AB5 and AB6, but I do not replicate it below.

As with above, there are similar problems where 1 language coincides with multiple ethnic groups. This is particularly true in North Africa, as shown below.

In the case of Morocco, Arabic coincides with two ethnic groups - Arabs and Saharwis. Arabs are dominant, and Sahrawis are discrimianted. However, Arabic-speaking Saharwis only make up .016 of the population.See similar issues Arabic being the sole-language in Sudan and Egypt. Example 4 - Excel Screencap

Therefore, the only change i really make is that if there are multiple ethinc groups to a single language, and one of the ethinc groups is less than .1 percent of the population (often much smaller) and is powerless, I delete that group in favor of the ethnic group that is much larger and has power. This is to better capture scenarios where very small marginalized ethnic groups (who likely are not even picked up by afrobarometer surveys) speak the same language as larger empowered groups. Therefore, in the above, I delet the information from row 277 and 278.

Vertical Merge

At this point in time, I have the Afrobarometer Wave 4-6 surveys merged with EPR. For roughly 70% of all respondents, I have their corresponding Ethnic Group Status information (which is not included in Afrobarometer).

To note, the ``ab_all_final.Rda’’ dataset is the individual respondent information from Afrobarometer that can be used to tackle Question 1.

Please see the ``clott_egd_q1_rmark.Rmd’’ file for a preliminary data analysis for Question 1.

Backtracking Age Profiles

Let’s try to create an aggregate now. So “aball” contains three waves of Afrobarometer:

  • Afrobarometer Wave 4 - Conducted in 2008
  • Afrobarometer Wave 5 - Conducted from 2011-2013
  • Afrboarometer Wave 6 - Conducted from 2014-2015

Therefore, I have a survey in which respondents were surveyed at different times, therefore their ages are not standardized - which means I need to fix this problem before backtracking age-cohort profiles. Therefore, I increase everyone’s age dependent upon when the survey was conducted. For instance, a 35 year old who was surveyed in 2008 would now (presumably, if alive) be 47. Of course this creates issues if someone was already elderly in 2008 (say ages 85+); however, we can keep them in the survey since we are assuming all education would be attained at a younger age. Therefore, keeping their observations helps bolster our estimates of earlier years. Furhtermore, I’m only standardizing from 2015 - the last year we have information.

From this information, my preliminary attempt at getting educational attainment rates per country group is to do the following:

  • Create a for loop that each iteration aggregates the average educational attainment per ethnic group per year, but then limit the respondent sample by age as I backtrack through time.
  • Each loop though removes respondents based on their age, with each loop corresponding to one year.
  • For instance, the first loop takes the average educational attainment of each ethnic group - using the educational attainment of all respondents (ages 18+). This gives us the educational attainment for 2015.
  • The second loop takes the average educational attainmetn of each ethnic group - using the educational attainment of all respondents ages 19+. This gives us the educational attainment for 2014.
  • The third loop takes the average educational attainment of each ethnic group - using the educational attainment of all respondents 20+. This gives us the educational attainmetn for 2013.
  • Repeat until I have information through 1969.

This method of backtracking group-year information is based upon the assumption that education (at least primary and secondary) will be completed int he first 18 years of respondents’ life, on average. Tertiary education will still be captured by individuals older than 18.

And voila! We have a dataset that has ethnic group education level per country year. Let’s melt it and add in the EPR information again. The trick is that the EPR information (whether or not a gruop was discrimianted/powerless/inpower etc. changes historically. So we actually want to merge this new dataset with the old EPR dataset. Therefore, we can say that “X Group was Discrimianted in 1975, and their education level was Y.”

Then we can look at a couple figures of what we have.

##   statename                                group year      Edu
## 1   Algeria                                Arabs 2015 3.195989
## 2   Algeria                              Berbers 2015 2.436170
## 3     Benin                  South/Central (Fon) 2015 2.376828
## 4     Benin Southeastern (Yoruba/Nagot and Goun) 2015 2.083770
## 5     Benin                  Southwestern (Adja) 2015 1.954301
## 6  Botswana                                Birwa 2015 3.659574

Example Figures & Analyses

The previous section (“Code & Creation of EGE”) leaves us with two novel datasets which we can explore further.

  1. A regional Ethnic Group Education dataset (EGE) for Africa. This dataset includes all ethnic group educational attainment rates from 1969-2015.
  2. An Afrobarometer Survey (Waves 4-6, 2008-2016) that has been merged with Ethnic Power Relations data. In other words, we can connect individuals to their ethnic groups along with information on said groups that was not originally included in Afrobarometer.

Ethnic Group Education (EGE)

The preliminary EGE contains ethnic group educational attainment rates from 1969-2015 in the following countries:

##  [1] "Algeria"       "Benin"         "Botswana"      "Burkina Faso" 
##  [5] "Burundi"       "Cameroon"      "Cote d’Ivoire" "Egypt"        
##  [9] "Ghana"         "Guinea"        "Kenya"         "Lesotho"      
## [13] "Liberia"       "Madagascar"    "Malawi"        "Mali"         
## [17] "Mauritius"     "Morocco"       "Mozambique"    "Namibia"      
## [21] "Niger"         "Nigeria"       "Senegal"       "Sierra Leone" 
## [25] "South Africa"  "Swaziland"     "Tanzania"      "Tunisia"      
## [29] "Uganda"        "Zambia"        "Zimbabwe"

For each year, it also includes the Ethnic Power Relations (EPR) ethnic group status information (whether or not the group was a monopoly, dominant, senior partner, junior partner, powerless, discriminated, or irrelevant). I’ve coded that to be “in power” or “not in power”, as defined by EPR. Groups that are “monopoly, dominant, senior partner, or junior partner” are considered to be in included in politics, whereas groups that are discriminated or powerless are considered to be excluded from politics.

Here is a look at the first 100 observations of the dataset.

With this information, we can see the following trends:

Educational Attainment

Educational Attainment

Alternatively, we can look at specific countries of interest.

Educational Attainment in Namibia

Educational Attainment in Namibia

Education, Ethnic Group Identity, and Political Attitudes

A core assumption of my theory is that education uniquely impacts marginalized communities as opposed to advantaged communities. I argue that education will foster comparatively stronger pro-democratic attitudes among marginalized groups as democracy is a means to inclusion. Alternatively, education will foster state-support (pro-authoritarian) views among advantaged groups to protect the status-quo.

I have another working paper that looks the individual level survey data from Afrobarometer to predict when individuals are more likely to identify with the state ethnic group. I argue that education will lead to marginalized individuals to be more likely to identify with their ethnic group, and that education will lead to advantaged individuals to be more likely to identify with their state.

Now that I have the merged Afrobarometer data that also includes EPR information, I can provide a hierarchical analysis of political attitudes dependent upon individuals’ membership in excluded/included ethnic groups.

Trust in Ruling Party

Trust in Ruling Party

Worldwide Academic Freedom

Worldwide Academic Freedom

National and Ethnic Identity Attachment

National and Ethnic Identity Attachment

##   COUNTRY Statename ccode       RESPNO age edu primary secondary tertiary
## 1       1     Benin   434 BEN0001-2008  38   4       1         0        0
## 2       1     Benin   434 BEN0002-2008  46   2       0         0        0
## 3       1     Benin   434 BEN0003-2008  28   4       1         0        0
## 4       1     Benin   434 BEN0004-2008  30   3       1         0        0
## 5       1     Benin   434 BEN0005-2008  23   4       1         0        0
## 6       1     Benin   434 BEN0006-2008  24   4       1         0        0
##   language year identity rural female employment employed democracy
## 1      100 2008        2     0      1          0        0         3
## 2      104 2008        4     0      0          1        0         3
## 3      101 2008       NA     0      1          2        1         3
## 4      100 2008        5     0      0          1        0         2
## 5      100 2008        5     0      1          1        0         3
## 6      100 2008        5     0      0          1        0         2
##   democracyInCountry satisfiedDemInCountry trustPresident trustParliament
## 1                  4                     4              3               1
## 2                  4                     4              1               1
## 3                  4                     2              3               2
## 4                  3                     2              1               1
## 5                  2                     2              3               3
## 6                  3                     3              1               2
##   trustRP treatedUnfairly languageName                                group
## 1       1               0          Fon                  South/Central (Fon)
## 2       0               0       Yoruba Southeastern (Yoruba/Nagot and Goun)
## 3       2              NA         Adja                  Southwestern (Adja)
## 4       0               0          Fon                  South/Central (Fon)
## 5       1               0          Fon                  South/Central (Fon)
## 6       2               1          Fon                  South/Central (Fon)
##    size         status timeDiff ageUpdate
## 1 0.330 JUNIOR PARTNER        7        45
## 2 0.185 JUNIOR PARTNER        7        53
## 3 0.150 JUNIOR PARTNER        7        35
## 4 0.330 JUNIOR PARTNER        7        37
## 5 0.330 JUNIOR PARTNER        7        30
## 6 0.330 JUNIOR PARTNER        7        31

Looking Ahead

There are a handful of obstacles yet to overcome:

  • Across all respondents in the Afrobarometer surveys 4-6, only 70% are currently matched with EPR. This is likely due to my having to drop instances where 1 language matches 2+ incompatible ethnic groups.
  • This is only information on Africa (and 35 countries therein). It would be ideal to envision a way to do this globally, potentially with information that could apply cross-country. I have explored doing so with information like World-Values Survey or the Demographic and Health Surveys; however, I would then need to find a different way to link individuals’ ethnic gropus with a dataset like EPR.
  • I don’t currently have confidence estimations for each estimate. It would be good to probably provide upper- and lower-bound estimates for each education estimate, dependent upon the number of respnodents that are aggregated in that estimation. Alternatively, I am also considering multi-level regression with poststratification where I can combine information from multiple datasets to create stronger estimates.